15 research outputs found
Rapid Resource Transfer for Multilingual Natural Language Processing
Until recently the focus of the Natural Language Processing (NLP)
community has been on a handful of mostly European languages. However, the
rapid changes taking place in the economic and political climate of the
world precipitate a similar change to the relative importance given to
various languages. The importance of rapidly acquiring NLP resources and
computational capabilities in new languages is widely accepted.
Statistical NLP models have a distinct advantage over rule-based methods
in achieving this goal since they require far less manual labor. However,
statistical methods require two fundamental resources for training: (1)
online corpora (2) manual annotations. Creating these two resources can be
as difficult as porting rule-based methods.
This thesis demonstrates the feasibility of acquiring both corpora and
annotations by exploiting existing resources for well-studied languages.
Basic resources for new languages can be acquired in a rapid and
cost-effective manner by utilizing existing resources cross-lingually.
Currently, the most viable method of obtaining online corpora is
converting existing printed text into electronic form using Optical
Character Recognition (OCR). Unfortunately, a language that lacks online
corpora most likely lacks OCR as well. We tackle this problem by taking an
existing OCR system that was desgined for a specific language and using
that OCR system for a language with a similar script. We present a
generative OCR model that allows us to post-process output from a
non-native OCR system to achieve accuracy close to, or better than, a
native one. Furthermore, we show that the performance of a native or
trained OCR system can be improved by the same method.
Next, we demonstrate cross-utilization of annotations on treebanks. We
present an algorithm that projects dependency trees across parallel
corpora. We also show that a reasonable quality treebank can be generated
by combining projection with a small amount of language-specific
post-processing. The projected treebank allows us to train a parser that
performs comparably to a parser trained on manually generated data
Providing readout of multimedia content in messages
This disclosure describes techniques to automatically extract information from messages that include multimedia content, e.g., image, video, etc., and provide a readout of the message content to the recipient. Extraction of the content can be performed using machine learning. For example, readouts can be provided for multimedia messages received via short message service (SMS), chat or messaging applications, email, etc. The techniques can be implemented as part of a virtual assistant application. The readout is provided upon user request. Messages are accessed and analyzed upon specific permission of the user
Domain Tuning of Bilingual Lexicons for MT
Our overall objective is to translate a domain-specific document in a
foreign language (in this case, Chinese) to English. Using automatically
induced domain-specific, comparable documents and language-independent
clustering, we apply domain-tuning techniques to a bilingual lexicon for
downstream translation of the input document to English. We will describe
our domain-tuning technique and demonstrate its effectiveness by comparing
our results to manually constructed domain-specific vocabulary. Our
coverage/accuracy experiments indicate that domain-tuned lexicons achieve
88% precision and 66% recall. We also ran a Bleu experiment to compare our
domain-tuned version to its un-tuned counterpart in an IBM-style MT
system. Our domain-tuned lexicons brought about an improvement in the
Bleu scores: 9.4% higher than a system trained on a uniformly-weighted
dictionary and 275% higher than a system trained on no dictionary at all.
UMIACS-TR-2003-19
LAMP-TR-09
Evaluating Translational Correspondence using Annotation Projection
Recently, statistical machine translation models have begun to take
advantage of higher level linguistic structures such as syntactic
dependencies. Underlying these models is an assumption about the
directness of translational correspondence between sentences in the
two languages; however, the extent to which this assumption is valid
and useful is not well understood. In this paper, we present an
empirical study that quantifies the degree to which syntactic
dependencies are preserved when parses are projected directly from
English to Chinese. Our results show that although the direct
correspondence assumption is often too restrictive, a small set of
principled, elementary linguistic transformations can boost the
quality of the projected Chinese parses by 76\% relative to the
unimproved baseline.
UMIACS-TR-2003-25
LAMP-TR-100
A Generative Probabilistic OCR Model for NLP Applications
In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system